LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 11-17-2010, 07:19 AM   #1
hashbang#!
Member
 
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120

Rep: Reputation: 17
[bash] find | grep optimization


My recent recent optimization question seemed to have generated some interest.

So I thought I'd post my findings on another case for discussion. (I hope this is appropriate on this forum.)

Scenario
A folder contains about 280,000 html files, all residing in monthly subfolders and following the same naming convention. Id's are numeric and of varying length.
Code:
data/<yyyy>/<mm>/event_<id>.html
I need to build up a file just containing the files' id's excluding a particular monthly folder.

First attempt time: 4.6s
Code:
find data/ -path data/1975/01/ -prune -o \( -type f -printf '%f\n' \) | \
grep -E -o -e [0-9]*

I then went back to basics adding to the above options to the command sequence to see where the time was spent.

find data -type f only takes an amazing 0.22s.
find data/ -path data/1975/01/ -prune -type f: 0.45s
find data/ -path data/1975/01/ -prune -o \( -type f -printf '%f\n' \): 0.6s

Clearly, the post-processing with grep used most of the time.
Optimization 1 time: 3.1s
Code:
find ... | grep -E -o -e [0-9]+

Optimization 2 time: 0.7s
Code:
find ... |  cut -d '_' -f 2| cut -d '.' -f 1
This obsoleted -printf: down to 0.61s

Optimization 3 time: 0.51s
Code:
find data -type f | grep -F -v "data/1975/01/" |  cut ...| cut ...

Unfortunately, the id's vary in length; otherwise I could have done it with one cut -c .



And just so we have a question here: how come cut is so efficient?

Last edited by hashbang#!; 11-17-2010 at 07:25 AM.
 
Old 11-18-2010, 06:15 PM   #2
tuxdev
Senior Member
 
Registered: Jul 2005
Distribution: Slackware
Posts: 2,012

Rep: Reputation: 115Reputation: 115
cut is efficient because it's not very powerful. It can't do nearly the same sort of stuff grep is capable of.

Actually, I'd personally wouldn't use cut, but rather bash's built in functionality that does the same sort of thing.
Code:
while read file ; do
   id="${file%.html}"
   id="${id##*_}"
   echo "$id"
done <(find ... | grep ...)
No clue how fast it is, though I suspect it isn't that different from your final version.
 
Old 11-19-2010, 03:59 AM   #3
hashbang#!
Member
 
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120

Original Poster
Rep: Reputation: 17
Quote:
Originally Posted by tuxdev View Post
Actually, I'd personally wouldn't use cut, but rather bash's built in functionality that does the same sort of thing. [...]
(still need redirection in addition to process substitution)
Code:
while read file ; do
   id="${file%.html}"
   id="${id##*_}"
   echo "$id"
done < <(find ... | grep ...)
Quote:
Originally Posted by tuxdev View Post
No clue how fast it is, though I suspect it isn't that different from your final version.
time: 27s (compared to find | grep | cut | cut @ 0.6s)
I know the double cut looks a bit of a cludge ...
 
Old 11-19-2010, 04:40 AM   #4
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
27s or 0.27s?
 
Old 11-19-2010, 04:47 AM   #5
hashbang#!
Member
 
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120

Original Poster
Rep: Reputation: 17
Quote:
Originally Posted by catkin View Post
27s or 0.27s?
27s! Otherwise I would have congratulated tuxdev!
 
Old 11-19-2010, 05:06 AM   #6
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
My gast is flabbered; I had expected tuxdev's solution to be a little quicker, not almost an order of magnitude slower! Looking for what causes such poor performance, how does this variant perform:
Code:
while read file ; do
   id="${file%.html}"
   id="${id##*_}"
   echo "$id"
done <<< "$(find ... | grep ...)"
 
Old 11-19-2010, 05:51 AM   #7
hashbang#!
Member
 
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120

Original Poster
Rep: Reputation: 17
Quote:
Originally Posted by catkin View Post
how does this variant perform:
Code:
while read file ; do
    [...]
done <<< "$(find ... | grep ...)"
31s

I have played around with bash loops on similar situations and found them to be slower than any built-in recursion/looping.
 
Old 11-19-2010, 08:42 AM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192
What about taking the grep out and doing something like:
Code:
while IFS='_' read -r _ line
do
    echo ${line%.html}
done< <(find data/ -path data/1975/01/ -prune -o -type f -regex '.*[0-9]+.html' -printf "%f\n")
 
Old 11-19-2010, 10:45 AM   #9
hashbang#!
Member
 
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120

Original Poster
Rep: Reputation: 17
Quote:
Originally Posted by grail View Post
What about taking the grep out and doing something like:
Code:
while IFS='_' read -r _ line
do
    echo ${line%.html}
done< <(find data/ -path data/1975/01/ -prune -o -type f -regex '.*[0-9]+.html' -printf "%f\n")
The find on its own takes 3.86s. With the while loop it takes 27s.

If you look at my original post, I took prune and printf out to save time.

find -regex is slower than find | egrep (see post #1, Optimization 1). Also, bear in mind that grep -o already produced the desired output (id only).
 
Old 11-19-2010, 10:56 AM   #10
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,782

Rep: Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083
Quote:
Originally Posted by catkin View Post
My gast is flabbered; I had expected tuxdev's solution to be a little quicker, not almost an order of magnitude slower! Looking for what causes such poor performance, how does this variant perform:
Come on guys, it's the while loop that's slow. Except for small operations, using external programs is always going to be faster than doing it in bash.

I was surprised that the -prune wasn't faster, but I think it's because there's only 1 directory to prune so it doesn't save much. And since -path takes glob patterns it's slower than grep -F which take fixed strings.
 
Old 11-19-2010, 01:04 PM   #11
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Quote:
Originally Posted by ntubski View Post
Come on guys, it's the while loop that's slow. Except for small operations, using external programs is always going to be faster than doing it in bash.
Is it the while loop or the read? If the read is the underlying cause that bash doesn't buffer stdin (that would be a strange design choice)?
 
Old 11-19-2010, 02:06 PM   #12
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,782

Rep: Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083
Quote:
Originally Posted by catkin View Post
Is it the while loop or the read? If the read is the underlying cause that bash doesn't buffer stdin (that would be a strange design choice)?
Well looking at read.def I see
Code:
unbuffered_read = (nchars > 0) || (delim != '\n') || input_is_pipe;
So it appears that read actually isn't buffering. I'm not certain, the code deals with a lot of different options so it's hard to follow.

However, I don't think it matters either way, here is a while loop without read:
Code:
#!/bin/bash
COUNT=40000 # maybe increase this if your computer is faster than mine.

echo use bash loop
declare -i i=0
time while ((i<COUNT)) ; do ((i++)) ; done

echo ========
echo use seq
time seq $COUNT >/dev/null
For me the while loop takes around 1.3 seconds and the seq takes 0.1 seconds. bash is over 10 times as slow and that's after I took out the echo statement.
 
Old 11-19-2010, 09:39 PM   #13
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Those are very educational figures, ntubski, and show the importance of measurement instead of blindly accepting received wisdom. For years I have accepted the plausible hypothesis, perhaps once true, that the fork+exec system calls to run an external program are slow relative to in-shell actions.

On my system the output from your script was:
Code:
use bash loop

real    0m0.426s
user    0m0.405s
sys    0m0.010s
========
use seq

real    0m0.071s
user    0m0.041s
sys    0m0.004s
I modified it to also test an equivalent for ((i=0; i<COUNT; i++)) loop and found it ~10% faster than the while loop. Then I added a call to /bin/true in the loop and found the /bin/true calls took ~0.0001 s each which is only ~100 times longer than each bare loop.

Presumably there is some caching and hashing which means the first call to /bin/true could have taken significantly longer than 0.0001 s.
 
Old 11-20-2010, 11:41 AM   #14
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,782

Rep: Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083
Quote:
Originally Posted by catkin View Post
For years I have accepted the plausible hypothesis, perhaps once true, that the fork+exec system calls to run an external program are slow relative to in-shell actions.
Well they are, but in the example here there are O(n) in-shell actions but O(1) external calls, so for large n the fork+exec overhead is negligible.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] [bash] egrep | sed - optimization? hashbang#! Programming 24 11-08-2010 09:18 AM
bash script: find, grep if and else programing.. snaaps Linux - Newbie 3 10-20-2010 08:24 PM
BASH script optimization for testing large number of files instag Programming 24 09-26-2010 11:40 PM
Find/grep/wc command to find matching files, print filename and word count dbasch Linux - Newbie 10 09-14-2009 05:55 PM
Bash scripting question with find and grep and .bashrc - a multi year problem brfindla Linux - Software 9 08-24-2009 09:54 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:27 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration